Location Name Extraction from Targeted Text Streams using Gazetteer-based Statistical Language Models

نویسندگان

Hussein Al-Olimat

Krishnaprasad Thirunarayan

Valerie L. Shalin

Amit P. Sheth

چکیده

Extracting location names from informal and unstructured texts requires the identi cation of referent boundaries and partitioning of compound names in the presence of variation in location referents. Instead of analyzing semantic, syntactic, and/or orthographic features, our Location Name Extraction tool (LNEx) exploits a region-speci c statistical language model to evaluate an observed n-gram in Twitter targeted text as a legitimate location name variant. LNEx handles abbreviations, and automatically lters and augments the location names in gazetteers from OpenStreetMap, Geonames, and DBpedia. Consistent with Carroll [4], LNEx addresses two kinds of location name contractions: category ellipsis and location ellipsis, which produces alternate name forms of location names (i.e., Nameheads of location names). The modi ed gazetteers and dictionaries of abbreviations help detect the boundaries of multi-word location names delimiting them in texts using n-gram statistics. We evaluated the extent to which using an augmented and ltered region-speci c gazetteer can successfully extract location names from a targeted text stream. We used 4,500 event-speci c tweets from three targeted streams of di erent ooding disasters to compare LNEx performance against eight state-of-the-art taggers. LNEx improved the average F-Score by 98-145% outperforming these taggers convincingly on the three manually annotated Twitter streams. Furthermore, LNEx is capable of stream processing.1

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

N-gram and Gazetteer List Based Named Entity Recognition for Urdu: A Scarce Resourced Language

Extraction of named entities (NEs) from the text is an important operation in many natural language processing applications like information extraction, question answering, machine translation etc. Since early 1990s the researchers have taken greater interest in this field and a lot of work has been done regarding Named Entity Recognition (NER) in different languages of the world. Unfortunately...

متن کامل

Ad-Hoc Georeferencing of Web-Pages Using Street-Name Prefix Trees

A bottleneck of constructing location-based web searches is that most web-pages do not contain any explicit geocoding such as geotags. Alternative solution can be based on ad-hoc georeferencing which relies on street addresses, but the problem is how to extract and validate the address strings from free-form text. We propose a rule-based solution that detects address-based locations using a gaz...

متن کامل

A Novel Approach to Conditional Random Field-based Named Entity Recognition using Persian Specific Features

Named Entity Recognition is an information extraction technique that identifies name entities in a text. Three popular methods have been conventionally used namely: rule-based, machine-learning-based and hybrid of them to extract named entities from a text. Machine-learning-based methods have good performance in the Persian language if they are trained with good features. To get good performanc...

متن کامل

روشی جدید جهت استخراج موجودیت‌های اسمی در عربی کلاسیک

In Natural Language Processing (NLP) studies, developing resources and tools makes a contribution to extension and effectiveness of researches in each language. In recent years, Arabic Named Entity Recognition (ANER) has been considered by NLP researchers due to a significant impact on improving other NLP tasks such as Machine translation, Information retrieval, question answering, query result...

متن کامل

روش جدید متن‌کاوی برای استخراج اطلاعات زمینه کاربر به‌منظور بهبود رتبه‌بندی نتایج موتور جستجو

Today, the importance of text processing and its usages is well known among researchers and students. The amount of textual, documental materials increase day by day. So we need useful ways to save them and retrieve information from these materials. For example, search engines such as Google, Yahoo, Bing and etc. need to read so many web documents and retrieve the most similar ones to the user ...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

CoRR

دوره abs/1708.03105 شماره

صفحات -

تاریخ انتشار 2017

Location Name Extraction from Targeted Text Streams using Gazetteer-based Statistical Language Models

نویسندگان

چکیده

منابع مشابه

N-gram and Gazetteer List Based Named Entity Recognition for Urdu: A Scarce Resourced Language

Ad-Hoc Georeferencing of Web-Pages Using Street-Name Prefix Trees

A Novel Approach to Conditional Random Field-based Named Entity Recognition using Persian Specific Features

روشی جدید جهت استخراج موجودیت‌های اسمی در عربی کلاسیک

روش جدید متن‌کاوی برای استخراج اطلاعات زمینه کاربر به‌منظور بهبود رتبه‌بندی نتایج موتور جستجو

عنوان ژورنال:

اشتراک گذاری